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Abstract  The  solution  of  symmetric  linearized  systems  of  equations  arising 
from  applications  of  the  finite  element  method  to  nonlinear  analysis  of  structures  is 
considered.  Two  related  schemes  that  are  based  on  Krylov  sequences  (e.g., 
Lanczos  and  conjugate  gradient  procedures)  are  examined.  The  conjugate  gradient 
procedure  is  derive  directly  from  the  Lanczos  process.  In  this  derivation  it  is  dem¬ 
onstrated  that  the  conjugate  gradient  implicitly  computes  the  triangular  factors  of 
the  reduced  tridiagonal  system  generated  by  the  Lanczos  process.  The  stability  of 
the  conjugate  gradient  procedure  depends  on  that  for  the  triangular  factorization 
that  can  be  guaranteed  only  for  positive  definite  systems. 

Loss  of  orthogonality  among  the  Lanczos  vectors  is  also  addressed  and  the 
method  of  partial  reorthogonalization  is  used  to  maintain  orthogonality.  This 
approach  improves  the  robusmess  of  the  algorithm  but  can  be  expensive  for  ill- 
conditioned  systems.  For  such  problems,  preconditioning  can  help  reduce  the 
number  of  iterations  required  for  convergence.  Three  different  preconditioning 
schemes  are  considered;  diagonal,  element-by-element  (E  x  E),  and  substructure- 
by-substructure  (S  X  S). 

The  above  methods  were  applied  to  a  number  of  different  example  problems. 
The  S  X  S  method  is  more  effective  than  either  the  diagonal  or  the  E  x  E  methods  at 
reducing  both  the  computation  cost  and  the  number  of  iterations.  For  ill-condi¬ 
tioned  systems  the  computation  cost  for  E  x  E  was  less  than  that  for  diagonal 
preconditioners  but  as  the  condition  number  improved,  the  cost  for  the  two  methods 
were  about  the  same. 
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1.  Introduction 


In  this  study  we  consider  the  solution  of  symmetric  linear  systems  of  equations  arising  from  appli¬ 
cations  of  the  finite  element  method  using  Krylov  based  algorithms  (e.g.  Lanczos  and  conjugate  gradient 
procedures).  The  Lanczos  algorithm  [  1  ]  was  first  introduced  in  1 950  as  a  method  for  computing  eigenvalues 
and  the  corresponding  eigenvectors  of  a  matrix.  In  1952  Hestenes  and  Stiefel  [2]  introduced  the  method  of 
conjugate  gradients  (CG)  for  solving  linear  systems  of  equations.  In  the  same  year,  Lanczos  showed  that 
his  algorithm  then  called  the  method  of  minimized  iteration  [3],  can  also  be  used  to  obtain  the  solution  of 
a  linear  system  of  equations.  In  fact,  these  methods  are  closely  related  in  the  sense  that  in  exact  arithmetic 
(when  no  roimdoff  errors  are  present)  they  compute  the  same  rqrproximate  solution  at  each  step.  A  fact 
known  to  both  Lanczos  and  Hestenes. 

An  important  motivating  faaor  for  using  Lanczos  and  CG  methods  is  the  theoretical  result  showing 
that  in  exact  arithmetic  both  methods  are  able  to  compute  the  solution  in  less  than  n  iterations,  where  n  is 
the  number  of  equations  in  the  system.  In  fact  the  number  of  iterations  will  be  less  than  the  total  number  of 
distinct  eigenvalues  in  the  system.  In  1960,  the  CG  method  was  first  used  to  solve  system  of  linear  equations 
arising  in  structural  mechanics  [4].  In  this  paper.  Lively  showed  that  CG  was  not  effective  for  solving  the  ill- 
conditioned  systems  that  often  arise  in  structural  analysis.  The  popularity  of  the  CG  method  vanished  when 
it  was  found  that  for  certain  problems  it  required  well  over  n  steps  to  converge  to  the  correct  solution.  CG 
was  then  abandoned  and  the  more  effective  direct  methods  based  on  triangular  factorization  of  the  matrix 
were  adopted  by  structural  analysts  as  their  method  of  choice. 

Nonlinear  transient  finite  element  problems  may  be  characterized  by  the  equations  of  dynamic  equilib¬ 
rium 

+  =  (1) 

where  M  is  the  mass  matrix,  f is  the  vector  of  internal  resisting  forces  due  to  the  displacements  w,  and 
f is  the  time  dependent  external  force  vector  [5].  The  above  system  is  generally  solved  by  applying  a 
step-by-step  time  integration  procedure  resulting  in  a  system  of  nonlinear  algebraic  equations.  The  solution 
to  this  system  is  obtained  using  a  Newton-Raphson  iteration  or  related  schemes.  At  the  heart  of  this  iteration 
is  a  set  of  linear  equations 

Ax  =  b  (2) 


where  x  is  the  correction  to  the  approximate  solution  vector  in  the  nonlinear  iteration  loop.  A  is  sym¬ 
metric,  positive  definite  and  sparse  and  is  related  to  the  system  in  (1)  through 


A  =  M  -I-  K5 


M'nt 


(3) 


/V*"’ 

where  K  =  and  5  is  a  scaler  parameter  that  depends  on  the  time  step  (e.g.  6  =  |At^).  For 
problems  with  small  bandwidth  or  problems  which  result  in  small  fill-in  in  the  triangular  factorization  of 


A,  direct  methods  are  the  fastest  solvers  [6].  When  the  bandwidth  of  A  is  large  (e.g.  large  complex  two 
and  three  dimensional  problems),  the  solution  of  the  linear  system  may  be  a  formidable  task  and  alternative 
procedures  must  be  considered. 

In  1971  the  advantages  of  Lanczos  and  CG  methods  were  recognized  when  attention  was  focused  on 
linear  systems  with  large  sparse  matrix  coefficients,  see  [7].  An  important  property  of  these  methods  is  that, 
at  each  step,  the  solution  to  (2)  can  be  obtained  with  out  an  explicit  knowledge  of  the  matrix.  Only  a  means  of 
computing  the  matrix  vector  product  Av  for  a  given  vector  v  is  required.  This  is  an  elegant  way  of  exploiting 
the  sparsity  structure  of  A.  Topically,  in  finite  elem.ent  analysis  there  are  fewer  than  100  nonzero  terms  in 
each  row  of  A.  The  number  of  nonzero  terms  in  A  is  independent  of  the  number  of  equations.  It  depends  on 
the  number  of  nodes  per  element,  the  number  of  parameters  per  tKxle,  and  the  number  of  elements  attached 
to  a  node.  The  number  of  equations  in  this  system  depends  on  the  complexity  of  the  structure  (i.e.  the 
domain)  and  also  on  the  amount  of  detail  desired  in  describing  the  solution. 

Preconditioning  is  introduced  to  alleviated  the  difficulties  associated  with  the  slow  convergence  of 
Lanczos  and  CG  methods.  Instead  of  solving  (2)  one  solves 

AB-'y  =  b  (4) 

where  x  =  B'^y.  B  is  referred  to  as  the  preconditioning  matrix.  The  advantages  of  preconditioning  can 
also  be  realized  by  solving  B~*  Ax  =  B~‘b.  The  number  of  iterations  required  to  solve  (4)  depends  on 
the  condition  number,  K,  of  its  coefficient  matrix.  k(AB~^  )  is  the  ratio  of  the  largest  eigenvalue  of  the 
eigenproblem  [A  -  A.B]z  =  0  to  its  the  smallest  (i.e  K  =  ((AB“  ^  ((  ((BA"  ^  (().  Theoretical  considerations 
suggest  that  at  the  end  of  each  of  the  first  few  iterations  of  both  Lanczos  and  CG  methods  the  residual  norm 
r/<(AB-‘)-. 

is  reduced  by  a  factor  of  -  .  -  ■  ■■  .  Note  that  when  ic  is  unity  a  single  iteration  is  sufficient  to  solve  the 

7k(AB^ 

equation.  Of  course  k  is  one  only  when  B  =  A!  However,  this  provides  us  with  a  guideline  for  choosing  B. 
B  must  be  chosen  such  that  one  can  easily  compute  the  solution  of  a  linear  system  of  equations  with  B  as 
its  matrix  coefficient  while  at  the  same  time  it  is  as  close  to  A  as  possible.  For  non-trivial  matrices  A,  these 
are  contradictory  requirements  which  makes  the  problem  of  finding  good  preconditioners  a  challenging  one. 
For  a  well  chosen  B  only  a  few  iterations  is  required  to  reduce  the  residual  norm  to  the  desired  level.  It  is 
important  to  note  that  the  condition  number  of  AB~  ^  depends  on  the  time  step  through  5  in  equation  (3)  as 
the  following  example  demonstrates. 

Example  1: 


LetK  = 


1  +  C  -1 

-1  1  +  c 


and  M  = 


1  0 
0  1 


.  Then 


A  = 


1  +  (1  +  C)5 

-5 


2 


-5 

l  +  (l  +  C)5 


The  eigenvalues  of  the  preconditioned  system  satisfies  the  quadradc  equation  det[A  -  XB]  =  0.  There  is  no 
change  in  the  condition  number  for  this  problem  with  diagonal  preconditioning  since  the  diagonal  of  A  is  a 
scalar  multiple  of  the  identity  matrix.  Then  the  condition  number  for  this  problem  becomes 

If-  1  +  +  C)S 

1  +  C5 

For  a  sufficiently  small  time  step  the  condition  number  is  close  to  unity.  On  the  other  hand  as  the  time  step 
increases  the  condition  number  tq)proaches  1  +  ^  which  may  be  aibitraiily  large  for  small  C- 

This  example  demonstrates  that;  (a)  the  performance  of  iterative  methods  for  solving  linear  systems 
of  equations  arising  from  traitsient  fiiute  element  problems  depends  strongly  on  the  time  step,  and  (b)  for 
a  given  finite  element  discreatization  static  problems  result  in  worse  conditioned  system  of  equations  than 
transient  problems.  These  facts  should  be  considered  when  assessing  the  performance  of  iterative  methods. 

Here,  we  focus  on  systems  which  arise  from  the  application  of  the  finite  element  method  to  engineering 
problems  whose  sparsity  structure  may  be  characterized  by 

A  =  N.aX  (5) 

e 

where  is  long  and  thin  Boolean  cormectivity  matrix  and  denotes  the  small  stifhess  matrix  for  element 
e.  We  can  take  advantage  of  this  structure  of  A  when  using  either  CG  or  Lanczos  methods  to  solve  (2).  In 
[8]  and  [9],  it  is  pointed  out  that  the  matrix  vector  product 

Au  =  5Z(NeaeN^u)  (6) 

e 

can  be  computed  without  ever  assembling  A.  The  evaluation  of  Au  using  (6)  requires  more  arithmetic 
operations  than  that  using  an  assembled  A  (assuming  some  compact  structure  where  no  zero  entries  of  A 
are  stored).  Typically,  the  number  of  arithmetic  operations  would  increase  by  about  three  folds.  However, 
the  use  of  parallel  and  vector  computers  produces  only  a  modest  increase  in  the  elapsed  time  and  in  certain 
cases  might  even  reduce  it. 

In  1983,  Hughes,  Levit  and  \^nget  [10]  proposed  a  time  integration  algorithm  for  the  solution  of  heat 
conduction  equations  that  uses  an  element-by-element  (ExE)  splitting.  In  the  same  year,  Ortiz,  Pinsky  and 
Taylor  [11]  proposed  a  novel  extension  of  the  ExE  procedure  to  the  solution  of  dynamic  equations.  These 
ExE  time  integration  algorithms  are  unconditionaUy  stable,  but  they  lack  accuracy  which  limits  their  use. 
Hughes,  Levit  and  Winget  [12]  reformulate  the  Ex  E  procedure  as  an  iterative  solver  to  achieve  the  accuracy 
and  stability  of  standard  finite  element  algorithms.  In  [13]  Nour-Omid  and  Parlett  addressed  the  problem 
of  preconditioning  (2).  The  idea  is  to  employ  the  methods  for  solving  differential  equations  presented  in 
[  10,1 1 ,12]  as  preconditioners.  The  resulting  preconditioners  use  the  element  representation  of  A  in  (6),  and 
requires  ik)  globally  assembled  matrix.  They  are  defined  as  the  product  of  positive  definite  element  matrices 
obtained  by  applying  a  diagonal  shift  to  the  positive  semi-definite  element  stifthess  matrices.  Winget  and 


Hughes  [14]  fiiither  developed  the  ideas  of  element  pteconditioners  and  constructed  a  variation  that  replaces 
the  terms  on  the  diagonals  of  the  element  matrices  with  the  corresponding  ones  in  the  assembled  matrix.  This 
modification  also  results  in  positive  definite  element  matrices.  A  prxxluct  algorithm  similar  to  that  in  [  1 1  ]  is 
then  constructed  using  the  Choleski  factorization  of  these  modified  element  matrices.  It  is  worth  noting  that 
these  pteconditioners,  though  computed  element-by-element,  are  an  approximate  Choleski  factorization  of 
A,  see  [  1 S].  Our  primary  interest  in  element-by-element  preconditioning  is  in  keeping  storage  requirements 
down  in  the  analysis  of  regular  structures,  but  the  advent  of  vector  and  parallel  computing  may  make  this 
approach  a  fast  one  as  well,  especially  in  three  dimensions. 

In  section  2  we  briefly  describe  the  Lanezos  algorithm  and  present  a  derivation  of  the  conjugate  gra¬ 
dient  method  from  the  Lanezos  algorithm  in  section  3.  We  then  turn  to  the  problem  of  orthogonality  loss 
that  affects  both  methods.  As  a  remedy,  we  consider  the  approach  of  partial  reorthogonalization  proposed 
by  Simon  [16].  The  element  pteconditioners  used  in  this  study  are  described  in  section  5  together  with  a 
discussion  of  some  implementation  issues.  In  section  6  we  describe  the  S  x  S  pteconditioners  and  its  relation 
to  Ex  E  method.  The  results  of  numerical  tests  on  two  characteristically  different  problems  are  presented  in 
section  7. 
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2.  Lanczos  Algorithm 

When  used  as  a  method  for  solving  linear  systems,  the  Lanczos  process  starts  from  a  given  initial 
approximation  to  the  solution,  Xq.  Associated  with  Xq,  define  the  residual  vector  Tq  =  b  -  Axq.  Unless  a 
good  estimate  to  the  desired  solution  is  available,  the  best  choice  for  Xo  is  the  zero  vector.  Then  Tq  =  b. 
Normalizing  Tq  gives  the  first  Lanczos  veaor,  .  Implicitly,  at  the  end  of  the  first  step  the  algorithm  obtains 
a  Galeikin  approximation  to  the  solution  of  (2)  from  the  one  dimensional  space  wim  as  the  base  vector. 
Associated  with  this  approximation  is  a  new  residual  vector.  The  normalization  of  this  residual  results 
in  the  second  Lanczos  vector,  q^.  Repeating  the  Galerkin  process  but  using  the  two  dimensional  space, 
span(q  j ,  q^ ),  followed  by  the  normalization  of  the  residual  one  obtains  q3 . 

At  a  typical  step,  j,  the  Lanczos  algorithm  computes  a  residual  vector  associated  with  a  best  s^proxi- 
mation,  Xj ,  (in  a  Galerkin  sense)  to  x  from  the  j  dimensional  space,  span[qj ,  qj , q^  ].  This  space  is  often 
referred  to  as  the  space  of  trial  vectors.  The  next  Lanczos  vector,  q^  ^  i ,  is  the  normalized  residual  b  -  Ax^ . 
This  process  is  repeated  until  the  norm  of  the  current  residual  is  small  compared  to  the  that  of  the  starting 
residual.  The  Galerkin  method  chooses  the  approximate  solution  Xj  by  forcing  the  associated  residual  to 
be  orthogonal  to  the  space  of  trial  (Lanczos)  vectors.  The  Lanczos  method  computes  neither  x^  nor  the 
associated  residual.  Irtstead  it  computes,  at  step  j,  a  j-vector  Sj  that  contains  the  weighting  parameters  for 
constructing  the  Galerkin  solution. 

Alternatively,  Lanczos  may  be  described  as  the  Gram-Schmit  orthogonalization  process  applied  to  the 
Krylov  space,  [ro,  AB~'ro,(AB"‘)^ro,  ...,(AB"')’"‘ro],  associated  with  equation  (4).  The  orthogonal¬ 
ization  is  performed  with  respect  to  the  B~  ^  inner  product.  The  result  of  this  orthogonalization  is  the  set 
of  Lanczos  vectors  [qi,q2i  is  obtained  by  orthonormalizing  (AB~  ’  y  Tq  against  the  computed 

Lanczos  vectors.  The  same  vector  q^^^  is  obtained  if  AB~^q^  is  used  instead  of  (AB~‘y  ro.  It  turns  out 
that  the  components  of  AB~  ^  q^  along  the  first  j  -  2  Lanczos  vectors  are  zero  and  orthogonalization  needs 
to  be  performed  only  against  q^  and  q^  _  ^  ■  The  result  is  a  vector  in  the  same  direction  as  the  residual  due 
to  the  Galerkin  approximation  described  above. 

The  algorithm  can  then  be  rewritten  as  the  three  term  relation 

r,  =  P,  +  ,q,  +  ,  =  AB-‘q.  -  q^a,  -  q^.^p.  (7) 

where  ttj  =  qj  B~  ^  AB~  *  q^  and  is  normalized  with  respect  to  the  inverse  of  the  preconditioner  to  obtain 

q^-^i  with  normalizing  factor  p^.^j  =  (rTB^'r^)*. 

The  j-th  step  of  the  Lanczos  algorithm  involves  the  calculation  of  a^,  P^^j.  and  q^^j,  in  that  order. 
In  addition  to  the  storage  needs  for  A  and  B,  the  algorithm  requites  storage  for  S  vectors  of  length  n;  one 
for  each  of  the  vectors,  q^  _  i .  q^ .  fj  ,Pj  =  B“ '  q^  and  p^  _  j .  The  total  cost  for  one  step  of  the  algorithm 
involves  one  solve  with  the  preconditioner  B  as  the  coefficient  matrix,  a  multiplication  of  A  by  a  vector,  two 
inner  products  and  four  products  of  a  scalar  by  a  vector.  A  summary  of  the  Lanczos  algorithm  is  presented 
in  Table  1. 


After  m  Lanczos  steps  all  the  quantities  obtained  ftom  equation  (7)  can  be  arranged  in  a  global  matrix 


form 


(8) 


Here  =  (0, 0, 0, 1),  is  an  n  x  m  matrix  with  columns  q,-, »  =  1,2, ...,  m,  and  is  the 
tridiagonal  matrix 

r«i  P2 

p2  «2  p3 
p3  .  • 


L  ■ 

The  orthogonality  property  of  the  Lanczos  vectors,  where  is  the  m  x  m  identity 

matrix,  can  be  used  in  equation  (8)  to  obtain 


q;:;b-^ab->q, 


=  T 

““  *  T» 


(10) 


A  Galericin  approximation  to  y  in  (4)  can  be  constructed  by  taking  a  linear  combination  of  the  Lanczos 
vectors.  Accordingly, 

Vm  ~  Qm®"!  (11) 

where  Sm  satisfies  the  tridiagonal  system  of  equations 

=  (12) 

The  last  equality  is  obtained  using  the  fact  that  the  starting  vector  is  Tq  =  pj  q^ .  ei  is  the  first  column  of  the 
identity  matrix.  Equation  (12)  is  a  weak  form  of  (4)  and  is  obtained  by  first  substituting  the  approximation 
to  y  ftom  (11)  into  equation  (4)  to  obtain  the  residual 


gm  =  AB-‘Q„s,„ -b  (13) 

Orthogonalizing  g„,  against  Q„,  with  respect  to  the  B~ '  inner  product  results  in  equation  (12).  is 
simply  related  to  t„  through 

gm  =  '■mO„  (14) 

where  Om  is  the  bottom  element  of  .  The  norm  of  this  residual,  P„,  =  llg^  II  =  Pm + 1  l^n>  I  • 
monitor  the  convergence.  Once  is  sufficiently  small  the  Lanczos  algorithm  is  terminated  and  Uk  solution 

is  constructed  using  (11).  The  Lanczos  vectors  can  be  put  on  secondary  storage  as  they  are  being  generated. 
There  are  two  main  reasons  for  keeping  the  Lanczos  vectors. 
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(a)  They  are  used  occasionally  in  subsequent  steps  to  restore  onhogtmality  (see  the  following  section  on 
Loss  of  Oithogonality  for  more  details). 

(b)  They  can  be  recalled  and  used  to  construct  the  solution  to  a  new  right  hand  side  [17].  The  algorithm  in 
Table  1  is  particularly  well  suited  for  multiple  right  hand  sides. 

Given  an  !q>proximate  soluticm  vector  Xo : 

(1)  Set 

(a)  To  =  b  -  Axo, 

(b)  qo  =  0. 

(c)  Solve  Bpj  =  Fq. 

(d)  Pi  =  (pr«’o)’. 

(c)  <il  = 

(0  Pi  =  ^Pi 

(2)  for  j  =  1,2, ...  repeat; 

(a)  tj  =  Ap^ 

(b)  T;  -qy-iPi 

(c)  =  qjB-ifj  = 

(d) 

(e)  Solve  Bpj  =  Vj 

(0  =(rjB-‘r,)i  =  (pJr,)i 

(g)  if  residual  norm  is  small  then  terminate  the  loop. 

(b)  Qj+i  = 

(i)  P;  +  l  =  p^P; 

(3)  Solution  X  =  Xo  +  B”  *  Q„  s,„ 

Table  1.  The  Lanczos  algorithm. 
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3.  Conjugate  Gradient  Algorithm 

In  this  section  we  give  a  derivation  of  the  conjugate  gradient  algorithm  directly  from  the  Lanczos 
process  [18,19,20].  The  conjugate  gradient  method  can  be  viewed  as  a  procedure  that  implicitly  computes 
the  triangular  factorization  of  through  an  update  algorithm  to  combine  the  steps  2  and  3  of  the  Lanczos 
algorithm  given  in  table  1.  Accordingly 

T„  =  (15) 


where  Dm  =  diag[bi,  83, . . . ,  5^]  and 


r  1 

-tOi  1 

-0)2  1 


Lm  = 


1 


(16) 


-tOm-l  1- 

The  components  of  T^ ,  ,  and  Dm  are  related  through  the  following  pair  of  equations. 

5*  =  Ofc  -©LiS/k-i 


=  - 


(17) 


These  two  equations  completely  define  the  algorithm  for  triangular  factorization  of  Tm .  Next  we  define 


Zm  =B-‘QmL; 


(18) 


An  important  property  of  Zm  is  that  its  columns  are  orthogonal  with  respect  to  A.  To  show  this  consider 

Z5;;AZm  =  L-‘Q„B-‘AB-iQmL-^ 


—  1  -lx  I 

=  Dm 


'Die  columns  of  Zm  are  said  to  be  conjugate  and  the  orthogonality  condition  of  Zm  with  respect  to  A  is 
referred  to  as  the  conjugacy  condition. 

Multiplying  both  sides  of  equation  (18)  by  Lm  and  using  the  bi-diagonal  structure  of  Lm  to  equate  the 
k-th  column  on  either  side  of  this  equation,  yields 

Zi  -  =  B-'q^  (19) 

Defining  d^  =  B~  the  above  equation  reduces  to 


z*  =  d/k  +  Zt_  I 


(20) 
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Due  to  the  conjugacy  property  of  we  are  able  to  update  the  solution  vector  Xt  by  simply  adding  a 
component  of  Zt .  Thus,  using  equation  (4)  we  have 


X*  =B-iy* 

=  Z*D;‘L;‘Piei 
=  *t-i  +YtZi 

where  y*  is  the  fc-th  element  of  D~  ‘  L;;*  p^Ci .  This  way  the  residual  vector  can  ^so  be  updated  using  the 
update  relation 


gt  =  gk-i  -YfcUk 


(21) 


and  u*  =  AZfc .  y*  is  related  to  the  component  of  D„  and  through 


Yt  = 


E* 

5* 


(22) 


where  p,^  is  updated  through 


P*  — 


(23) 


with  pj  =  Pi .  The  above  equation  is  simply  the  forward  reduction  algorithm  to  compute  L“  *  Pj  Ci . 

Thus  the  CG  method  directly  computes  the  triangular  factors  of  by  updating  the  factors  of  T„,  _  i . 
The  result  is  the  algorithm  in  Table  2.  It  is  important  to  note  that  T,„  is  often  indefinite  when  A  is  a  sym¬ 
metric  indefinite  matrix.  In  this  case  the  conjugate  gradient  algorithm  is  not  reliable  since  the  triangular 
factorization  of  T^  may  be  numerically  unstable.  This  instability  occurs  when  ever  8t  is  small.  Note  that 
8t  =  zj  Ut  is  the  denominator  of  the  right  hand  side  of  (2b)  in  Table  2. 

The  CG  algorithm  generates  a  sequence  of  approximations,  X;^ ,  to  the  solution  x  with  a  corresponding 
residual  vector  .  The  termination  criterion  can  be  chosen  based  on  these  quantities.  In  addition  to  storage 
demartds  for  A  and  B  the  algorithm  requires  storage  for  4  vectors. 
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Given  an  approximate  solutitm  Xq  then: 

(1)  Set 

(a)  go  =  b  -  Axo 

(b)  Zi  =  go 

(c)  Solve  Bdi  =  go 

(d)  P,  =  gjdi 

(2)  for  fc  =  1, 2, ...  repeat; 

(a)  Ui  =  AZfc 

(b)  Y/b  = 

(c)  Xi=Xi_i+YtZi 

(d)  gt=gt_i-YtUi 

(e)  Solve  Bdt+i  =  gj. 

(0  P*+i=gId*+, 

(g)  if  Pt+ 1  <  tol  ■  po  then  tenninate  the  loop. 

(h)  <0*  =  ^ 

(i)  z*+i  =  di+i  +  ©tZ* 


Table  2.  The  Conjugate  Gradient  Algorithm 


4.  Loss  of  Orthogonality 

bi  finite  precision,  each  computation  introduces  a  small  error  and  therefore  the  computed  quantities 
wiU  differ  from  their  exact  counterparts.  Our  objective  here  is  to  state  the  effect  of  roundoff  error  on  the 
Lancxos  process.  For  this  purpose  we  denote  by  e  the  smallest  number  in  the  computer  such  that  1  +  o  1- 
It  is  known  as  the  unit  roundoff  error. 

Although  the  tiidiagonal  relation,  Eq.  (8),  is  preserved  to  within  roundoff,  the  B~  ^  oithogoruility  prop¬ 
erty  of  the  Lanczos  vectors  completely  breaks  down  after  a  certain  number  of  steps  depending  on  e  and 
the  distribution  of  the  eigenvalues  of  B~  ^  A  [16,19].  The  Lanczos  vectors  not  only  lose  their  orthogonality, 
but  may  even  become  linearly  dependent.  This  problem  also  effects  the  conjugate  gradient  method  in  the 
form  of  loss  of  conjugacy.  A  direct  consequence  of  this  loss  of  orthogonality  is  delay  in  convergence  to  the 
desired  solution. 

The  loss  of  orthogonality  can  be  viewed  as  the  subsequent  amplification  of  the  errors  introduced  after 
each  computation.  We  let  denote  the  computed  Lanczos  vectors  and  define  the  following  matrix 

H„=Q3:B-‘Q„  (24) 

In  exaa  arithmetic  Hm  is  the  identity  matrix.  The  off-diagonals  of  H,„  will  depend  on  e,  the  unit 
roundoff  error.  Simon  [16]  found  a  recurrence  relation  that  can  be  used  to  estimate  the  elements  of  a  column 
of  from  the  elements  of  and  the  elements  in  the  previous  columns  of  .  This  recursion  can  be 
stated  in  veaor  form 

P;+ihi+i  »  T^-ih^  -  a;h,  -  p^h,.i  (25) 

where  hj  _  1 ,  hj  and  h^ + 1  are  vectors  of  length  j  -  1  containing  the  top  j  -  I  elements  of  the  j  -  1 ,  j , 
and  j  -f-  1-th  columns  of  (H,^  -  Im  )•  Here,  the  bottom  element  of  h,  _  i  is  e.  The  orthogonality  state  can  be 
monitored  by  updating  +i  in  the  course  of  the  Lanczos  algorithm. 

A  number  of  preventive  measures  can  be  taken  to  maintain  a  certain  level  of  orthogonality.  Lanczos 
was  aware  of  the  effects  of  roundoff  on  the  algorithm  when  he  presented  his  work.  He  proposed  that  the 
newly  computed  vector,  qy.|.  i ,  be  explicitly  orthogonalized  against  all  the  preceding  vectors  at  the  end  of 
each  step.  We  will  refer  to  this  technique  as  “full  teorthogonalization”  method.  It  enforces  orthogonality  to 
within  roundoff  (i.e.  |qfB~‘q^|  <  nE,t  ^  j).  In  [16]  Simon  showed  that  the  computed  tridiagonal  remains 
accurate  to  within  roundoff  if  the  more  relaxed  orthogonality  condition 

|qJ’B-‘q^|  <  v^,tV  J  (26) 

is  enforced.  We  refer  to  this  as  the  semi-orthogonality  condition,  and  to  procedures  that  adopt  the  weaker 
condition  as  selective  orthogonalization  methods.  Simon  proposed  to  update  ^  i  using  (26)  and  monitor 
the  magnitude  of  its  elements.  Whenever  any  component  of  i  is  greater  than  y/t  then  semi-orthogorulity 


may  be  lost  between  and  some  columns  of  Q^-.  At  this  step  the  appropriate  Lanczos  vectors  are  brought 

in  from  secondary  store  and  ^  j  is  oithogonalized  against  each  of  them.  This  operations  must  be  carried 
out  in  two  successive  steps  to  avoid  propagation  of  the  errors. 

A  variant  of  Simon’s  scheme  is  to  restore  orthogonality  of  q^  and  ^ .  at  the  same  time.  In  this  way 
noreorthogonalizationofq^.^2  will  be  necessary,  at  the  end  of  the  next  step.  The  number  of  operations  for 
this  scheme  is  the  same  as  that  of  the  scheme  above,  but  vectors  are  retrieved  only  once  and  therefore  the 
I/O  overhead  is  halved. 

The  disadvantage  of  reorthogonalization  is  that  additional  storage  is  required  to  keep  the  Lanczos  vec¬ 
tor.  If  m  steps  are  required  to  reduce  the  residual  to  the  desired  level  then  storage  for  m  vectors  of  length 
n  is  needed.  The  advantage  is  that  reorthogonalization  can  significantly  reduce  the  number  of  steps.  'Dus  is 
demonstrated  by  the  numerical  examples. 
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5.  Element  by  Element  Preconditioning 

The  idea  of  using  solution  algorithms  firom  discretized  partial  differential  equations  for  constructing 
preconditioners  is  not  new.  As  early  as  1963  Wachspress  proposed  one  such  preconditioner  based  on  the 
ADI  method  [21].  Recently,  in  [13]  we  proposed  a  number  of  preconditioners,  based  on  the  element-by¬ 
element  representation  of  A  for  solving  Mx  Ax  =  b.  Here  M  is  a  diagonal  matrix.  The  steady  state 
solution  of  this  equation  is  the  same  as  that  of  (2). 

The  first  preconditioner  uses  a  Choleski  factorization  of  the  element  matrices  shifted  by  a  diagonal 
matrix.  We  denote  the  diagonally  scaled  A  by 

n. 

A  =  M-  >  AM"  ^  N,a.Nj  (27) 

e=l 

where,  Tit  is  the  total  number  of  elements.  Then  the  proposed  preconditioner  can  be  constructed  in  the 
following  manner.  First,  the  Choleski  factors  c<cf  =  ol  -|-  is  computed.  The  shift  was  applied  to 
eliminate  the  singularity  of  a^.  A  lower  triangular  matrix,  C  is  formed  as  the  product  of  the  Ct ’s.  C  is  an 
approximation  to  the  Choleski  factorization  of  A.  The  resulting  preconditioner  is  given  by 

n«  1 

B-*  =  M>  n  N,c;*Nf  n  NeC7^NfMa  (28) 

e=l  e=n. 

Note  that  the  second  product  is  carried  out  in  the  reverse  order  of  the  first.  Numerical  results  indicated  that 
a  shift  0=1  results  in  a  preconditioner  that  is  close  to  the  optimum. 

Writing  it  =  Uc  -I-  oj  -i-  de ,  where  d«  and  Q«  denote  the  diagonal  and  strict  upper  triangular  part  of  3; , 
a  second  preconditioner  was  constructed  as 

n,  I 

B-i  =  M>  nN.(I  +  d.-HQ[)-*Nj  n  N.(I-|-d. -|-u.)-'N[M>  (29) 

e=I  «=T», 

A  comparison  of  these  two  preconditioners  on  small  problems  indicated  that  the  Choleski  form  is  more 
effective. 

Ex  E  Choleski: 

In  [14]  Winget  replaced  the  diagonal  of  with  the  idoitity  matrix  to  form 

B-'  =  fi  N*c7X  n  N.e:^N[Ma  (30) 

e=l  e=n, 

where  =  I  +  Qe  -1-  of  -  This  avoids  the  artificial  shifting  of  a,. 
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ExE  LU  Split: 

Similarly,  by  dropping  d«  in  (29),  a  new  but  simpler  preconditioner 

B  =  MinN,(I  +  or)Nf  ri  N.(I+Q.)NfM^  (31) 

«=1  «=n, 

can  be  constructed  that  eliminates  the  need  for  forming  Choleski  factorization  of  element  matrices.  The  main 
advantage  is  the  reduction  in  storage  since  B~ '  v  can  be  evaluated  for  any  v  using  the  element  representation 
of  A.  In  the  next  section  we  compare  the  two  preconditioners  defined  in  (30)  and  (31)  using  both  Lanczos 
and  CG  methods. 

Implementation 

We  view  the  computation  of  the  matrix-vector  product  Au  via  equation  (6)  as  a  mechanism  for  saving 
storage  in  return  for  extra  arithmetic  work.  The  reduction  in  storage  demand  is  due  to  the  following: 

1 .  In  most  practical  finite  element  problems  there  is  a  considerable  amount  of  repetition  of  a  given  element 
in  the  mesh  stmcture. 

2.  The  element  matrices  of  a  number  of  element  types,  such  as  beams,  trusses,  etc.,  are  known  explicitly 
and  depend  on  only  a  few  fundamental  parameters. 

The  first  observation  allows  us  to  create  a  data  structure  which  keeps  the  element  matrix  of  one  element 
to  represent  a  whole  group  of  elements.  The  second  observation  results  in  a  canonical  form  for  each  element 
type,  and  therefore  only  a  few  parameters  need  be  stored  to  define  each  element  matrix.  Hence  the  storage 
requirements  for  all  the  distinct  a«  is  often  significantly  less  than  the  number  of  words  requited  to  hold  A, 
even  when  a  sophisticated  sparse  storage  scheme  is  used  (see  [6]).  Furthermore,  one  can  always  recompute 
the  element  matrices  a^  each  time  the  product  Au  is  requited. 

The  overhead  for  the  reduction  in  storage  is  the  increased  number  of  operations.  However,  two  com¬ 
ments  are  in  order: 

1 .  The  cost  of  a  multiply  no  longer  dominates  arithmetic  evaluations. 

2.  Vector  and  Parallel  computers  or  other  special  purpose  devices  can  execute  II,(NeaeNju)  very  effi¬ 
ciently. 

When  using  the  implicit  form  of  Au  it  can  be  seen  that  different  elements  operate  on  different  parts  of 
the  veaor  v.  One  can  take  advantage  of  this  faa  by  performing  some  of  the  element  matrix  operations  in 
parallel.  The  elements  are  simply  arranged  into  p  groups,  where  p  is  the  number  of  available  processors. 
Each  processor  then  computes  the  contribution  of  the  product  of  all  the  element  matrices  in  its  assigned  group 
by  the  corresponding  components  of  v.  Finally,  the  contribution  from  each  group  is  combined  to  obtain  Au. 
The  implicit  product  itKteases  the  cost  of  matrix  operations  by  a  factor  of,  say  p,  where  p  depends  on  the 
average  number  of  elements  connected  to  a  node.  Typically  p  range  between  1.5  and  3(13],  although  one 
could  design  examples  that  result  in  large  p. 
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6.  Substructure  by  Substructure  Preconditioning 

An  obvious  generalization  of  the  element  by  element  preconditioner  described  above  is  the  substruc¬ 
ture  by  substructure  (SxS)  preconditioner.  Here  the  finite  element  mesh  is  partitioned  into  a  number  of 
substructures  (also  referred  to  as  sub-domains  or  super-elements).  Each  substructure  consists  of  a  group 
of  elements.  Associated  with  each  substructure  one  can  define  a  stiffiiess  matrix.  The  assembly  of  the 
substructure  stifihess  matrices  results  in  the  global  matrix 


A  =  ^  N5A5NJ 


(32) 


5 

where  A5  denotes  the  stifihess  matrix  for  a  substructure.  This  is  a  generalization  of  the  element  by  element 
preconditioner  in  the  sense  that  for  the  special  case  when  each  substructure  consists  of  a  single  element, 
equation  (32)  reduce  to  (5). 

Following  a  similar  approach  to  the  development  of  ExE  preconditioner  we  are  required  to  perform 
some  form  of  factorization  with  A5 .  It  is  important  to  note  that  any  preconditioner  obtained  from  the  fac¬ 
torization  As  will  depend  on  the  ordering  of  the  urdoiowns  associated  with  the  parameters  in  the  given 
substructure.  By  adopting  a  special  ordering  where  the  interior  nodes  for  each  substructure  are  numbered 
first  one  can  arrive  at  the  hybrid  scheme  proposed  in  [23].  Then,  the  unknowns  associated  with  each  sub¬ 
structure  can  be  partitioned  as 


Xs 


=  {x1} 


(33) 


where  denotes  the  unknowns  in  the  interior,  and  xf  denotes  the  unknowns  on  the  boundary.  Here,  all 
quantities  associated  with  boundary  and  interior  unknowns  are  denoted  with  superscripts  B  and  I,  respec¬ 
tively.  The  terms  in  equation  (32)  can  be  expressed  as 


Ac  = 


ilIbT 


Af  1 

Af^ 


(34) 


and 


N5  =  [0,0.  .1,0  ...0,  Nf] 


(35) 


The  first  part  of  N5  consists  of  zero  blocks  except  for  an  identity  block  corresponding  to  the  interior  un¬ 
knowns.  The  right  hand  side  vector  may  also  be  partitioned  in  a  similar  manner  to  obtain 


lAf^th  the  above  ordering  the  linear  system  in  (2)  takes  the  form 


A1 


// 


A(«Nf^ 
Nf’ 


LNf  A(®' 


A" 


Nf  A(®^ 


A‘^ 

"2 


a" 

nIaV’ 


A^/Nf^ 

ABB 


»>{ 

*2 

*>'2 

>  =  - 

*5 

rB 

V  X  J 

(36) 


(37) 


15 


where  A®  ® ,  b® ,  and  x®  are  related  to  quantities  in  equations  (33),  (34),  (35)  and  (36)  through 

A®®  =ENfA®®N®^ 
s 

b^=ENfbf  (38) 

s 

X®  =  Nf ’’x® 

The  block  arrow  structure  of  the  coefficient  matrix  in  (37)  is  due  to  the  fact  that  the  interior  unknowns 
in  one  substructure  interaa  with  those  in  a  second  substructure  only  through  the  boundary  unknowns  x® . 
Eliminating  the  interior  unknowns  and  using  the  definitions  in  (38)  one  obtains  the  linear  system  of  equations 
for  the  boimdary  unknows 

;[]NfAf^N®"’  x®  =  ^Nf6f  (39) 

•  s  J  s 

where  bf  =  bf  -  A5®^(A”)  'b^  andAf^  =  Af  ®  -  A5®^(A5^)  ^A^®  is  the  Schur  complement  of 
a"  in  (34).  The  main  advantage  with  this  approach  is  that  the  matrix  in  (39)  has  a  better  condition  number 
that  the  original  system  (see  [23]  for  detail).  It  is  important  to  note  that  these  quantities  can  be  computed 
independently  of  the  other  substructures  and  therefore  completely  in  parallel. 

The  structure  of  matrix  coefficient  in  equation  (39)  is  similar  to  the  matrix  in  (5)  in  the  sense  that  they 
are  both  constructed  using  an  assembly  process.  The  only  difference  is  that  in  (39)  one  is  dealing  with  larger 
matrices.  This  similarity  may  be  used  to  construct  pteconditioners  for  equation  (39)  in  much  the  same  way 
as  in  section  5.  However,  the  difficulty  with  the  produa  form  for  the  preconditioners  defined  in  section  (S)  is 
the  sequential  nature  of  the  algorithm.  In  general,  the  product  form  in  equations  (30)  and  (31)  whoi  applied 
to  the  S  X  S  partition  requires  the  processing  of  substructure  one  at  a  time.  This  can  hinder  parallelism  when 
implemented  on  a  multi-processor  computers.  Although  there  are  schemes  designed  to  minimize  the  impact 
of  the  produa  form  on  parallel  implementation  (through  graph  coloring  algorithms),  when  the  number  of 
substructures  are  close  to  the  number  of  available  processors  these  schemes  are  not  effective. 

An  alternative  preconditioner  that  is  suitable  for  concurrent  implementation  can  be  construaed  using 
the  splitting  algorithm  proposed  in  [24-26]  for  transient  finite  element  analysis.  This  scheme  when  applied 
to  the  problem  in  (1)  results  in  an  algorithm  that  has  an  additive  form  and  thus  lends  itself  to  parallel  imple¬ 
mentation.  Thus,  the  new  preconditioner  takes  the  form 

B-‘=  ENfU®®’"’N®’'  D  ENru®®‘'N®  (31) 

■  s  J  I-  s 

where  D  is  diagotuQ  matrices  and  Uf  ®  is  the  lower  triangle  of  Af  ^ .  A  good  choice  for  D  is  the  diagonal 
of  the  coefficient  matrix  in  (39)  obtain  by  simply  assembling  the  diagonals  of  Af 

The  steps  in  evaluating  B"  *  v  for  a  given  veaor  v  is  similar  to  those  for  computing  the  product  pf  Af  ® 
and  a  vector.  First,  v  is  localized  to  each  substructure  through  Nf  v.  This  is  followed  by  a  step  of  forward 


16 


reduction  using  the  lower  part  of  Af  ^ .  Note  that  the  forward  reduction  is  performed  for  each  substructure 
independently  of  the  others.  The  result  is  then  assembled  to  obtain  a  new  vector  which  is  then  multiplied  by 
D.  A  second  localization  of  last  result  followed  by  a  biu:k-substituti(Mi  and  assembly  completes  the  operation 
with  the  preconditioners. 
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7.  Numerical  Examples 


In  this  section  we  discuss  the  results  obtained  from  the  application  of  the  Lanczos  and  CXj  methods 
to  two  different  example  problems.  The  hrst  example  is  a  cavity  driven  flow  problem  (Stokes  flow).  The 
incompressibility  of  the  fluid  is  represented  by  local  volumetric  constraints.  These  constraints  are  enforced 
in  each  finite  element  using  a  penalty  method.  The  penalty  parameter  represents  the  bulk  modulus  of  the 
fluid.  4(X)  elements  are  used  to  model  this  problem.  See  figure  1(a).  The  condition  number  of  A  increases 
with  the  penalty  parameter.  We  refer  to  this  as  material  ill-conditimiing. 

The  second  example  we  used  is  a  beam  in  pure  bending.  Taking  advantage  of  symmetry,  a  quarter 
of  the  beam  is  modeled  using  plane  stress  elements,  see  figure  1(b).  The  beam  was  analyzed  for  a  range 
of  different  thicknesses  while  keeping  the  length  constant.  This  way  the  element  aspect  ratio  (ratio  of  the 
largest  dimension  to  the  smallest)  can  be  varied.  Again,  the  condition  of  A  increases  with  the  aspect  ratio. 
We  refer  to  this  as  geometric  ill-conditioning.  Three  different  levels  of  mesh  refinement  were  used  to  study 
the  effect  of  problem  size  on  the  algorithms ;  4  x  16, 8  x  32  and  16  x  64. 


Lanczos  with  Partial  Reorthogonalization 
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_ 
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_ 
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Choleski 

Diagonal 
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Reorth. 

# 

Reorth. 

# 
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# 
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# 
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B 
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Cost* 

Iter. 
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Iter. 
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Iter. 

Iter. 

Iter. 
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17245 
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95424 
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10^ 

10® 
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11134 

184 

8900 
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10^ 

10^ 
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no 
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no 
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10' 
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40 

0 

39 

0 

98 
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40 

39 

98 

1 

10^ 

26 

0 

_ 

25 

0 

60 

0 

26 

25 

60 

*  Unit  is  one  dot  product  and  one  SAXPY  (vector  plus  a  scalar  times  a  vector). 


Tables.  Result oftests using 20x20 mesh,  n  =  722. 

To  illustrate  the  advantages  of  semi-orthogonality  we  evaluate  the  solution  for  the  4  x  1 6  beam  problem 
using  the  CXI  method,  Lanczos  method  with  full  reorthogonalization,  and  Lanczos  with  Simon’s  scheme  for 
maintaining  semi-orthogonality.  In  figure  2  we  plot  the  residual  norm  against  the  iteration  number  for  each 
of  the  methods.  The  results  for  the  two  implementations  of  the  Lanczos  method  are  indistinguishable.  The 
residual  norm  in  the  CG  method  starts  off  the  same  as  that  in  the  Lanczos  method,  but  it  deviates  quickly 
and  takes  four  times  as  many  steps  to  converge.  The  difference  in  the  curves  for  the  CG  and  Lanczos  is  due 
to  loss  of  orthogonality  in  the  CG  method. 
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We  use  the  20  x  20  Stokes  flow  problem  to  make  a  direct  comparison  betweoi  three  different  pre- 
conditioners;  diagonal  scaling,  the  ExE  Choleski  defined  in  (30),  and  the  ExE  LU  split  defined  in  (31). 
We  solve  this  problem  for  a  range  of  different  penalty  parameters  using  both  Lanczos  and  CG  methods.  A 
summary  of  the  results  is  given  in  Table  3.  Sample  plots  of  the  residual  noim  against  the  iteration  num¬ 
ber  are  illustrated  in  Figures  3  and  4.  The  number  of  iterations  required  to  obtain  the  solution  using  ExE 
LU  split  is  marginally  more  than  that  using  ExE  Choleski.  On  average  the  cost  of  reorthogonalization  for 
ExE  Choleski  was  slightly  less.  On  the  other  hand,  ExE  Choleski  requires  additional  storage  to  keep  the 
preconditioning  matrix.  It  is  interesting  to  note  that  the  number  of  reorthogonalizations  increases  with  the 
penalty  parameter  (condition  number).  This  indicates  that  Simon’s  scheme  performs  reorthogonalizations 
when  ever  it  is  needed.  The  number  of  iterations  given  in  Table  3  are  plotted  against  the  penalty  parameter, 
see  Ingure  S.  One  can  observe  from  this  plot  that  maintaining  orthogonality  can  results  in  reductions  of 
factors  of  two  in  the  number  of  iterations  for  this  example. 


i 

Lanczos  with  Partial  Reorthogonalization 
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K 
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Iter. 
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■ 
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41 

96 
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56 
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71 
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B 
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87 
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507 

H 
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151 
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H 
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2216 
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*  Unit  is  one  dot  product  and  one  SAXPY  (vector  plus  a  scalar  times  a  vector). 

Table  4.  Result  of  tests  using  the  beam  in  pure  bending.  +  indicates  that  CPU  time  exceeded. 


We  obtain  a  similar  set  of  results  for  the  beam  example;  see  Table  4.  Both  diagonal  and  ExE  LU  split 
are  used  to  precondition  the  problem.  The  solution  is  evaluated  for  three  different  levels  of  discretization 
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using  Lanczos  and  CG  methods.  Sample  plots  for  the  8  x  32  mesh  are  illustrated  in  Figure  6  for  thickness 
t  =  1.0  and  in  Ingure  7  for  t  =  0.1.  When  t  =  1.0,  CG  method  required  three  times  as  many  steps  to 
converge  as  the  Lanczos  method.  More  crucial  is  the  faa  that  CG  failed  to  converge  after  6000  iterations 
whent  <  0.5,  even  for  Ex  E  preconditioners.  On  the  other  hand  Lanczos  delivered  the  solution  in  less  than 
300  iterations  using  Ex E  preconditioners. 

The  16  X  64  beam  problem  was  also  analyzed  using  CG  method  with  SxS  preconditioner  using  t  =  1.0 
and  t  =  0.05  corresponding  to  element  aspect  ratios  1  and  80,  respectively.  The  mesh  is  partitioned  into  4, 
8, 16,  and  32  substructures.  All  the  partition  lines  were  through  the  thickness  of  the  beam.  These  partitions 
were  chosen  with  equal  number  of  elements  in  each  substructure  to  illustrate  the  performance  of  the  SxS 
solution  algorithm.  The  computations  were  carried  out  on  ahypercube  concurrent  computer  having  a  total  of 
32  processors.  Each  substructure  was  assigned  to  a  different  processor.  Thus,  when  the  number  of  partitions 
is  8,  only  8  of  the  processors  in  the  hypercube  is  utilized.  The  results  for  these  analysis  arc  given  in  Table  5. 


Beam 

Solution 

Elimination 

PCG 

No.  of 

Thickness 

Time  (Sec.) 

Time  (Sec.) 

Time  (Sec.) 

Iterations 

1.0 

4 

91 

82 

9 

8 

53 

37 

16 

16 

40 

14 

26 

148 

32 

43 

3 

40 

228 

0.05 

4 

110 

82 

28 

164 

8 

80 

37 

43 

16 

68 

14 

54 

32 

76 

3 

73 

410 

Table  S.  Result  of  CG  with  SxS  Preconditioner  for  the  beam  in  pure  bending. 

The  elimination  of  the  interior  unknows  in  each  substructure  was  carried  out  using  a  direct  method 
(profile  solver).  As  the  mesh  is  partitioned  into  more  substructures,  the  number  of  unknowns  in  the  interior 
decreases.  For  the  partitioning  scheme  used  for  this  problem,  the  bandwidth  of  the  substructure  stifihess 
matrix  does  not  change  with  the  number  partitions.  As  a  result,  the  parallel  time  for  eliminating  all  the 
interior  degrees  of  freedom  becomes  inversely  proportional  to  the  number  of  substructures.  Moreover,  the 
number  of  boundary  nodes  in  each  substructure  remains  constant.  A  direct  consequence  of  this  is  that  the 
total  number  of  interface  node  and  thus  the  number  of  equations  in  the  reduced  system  solved  by  PCG 
become  directly  proportional  to  the  number  of  substructure.  Then  the  parallel  time  for  each  PCG  iteration 
remains  constant.  For  the  SxS  preconditioners,  the  number  of  PCG  iterations  depends  on  the  number  of 
equations  and  for  this  choice  of  partitions,  the  total  parallel  PCG  time  is  proportional  to  the  square  root  of 
the  number  of  processors. 
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This  example  clear  indicates  that  there  is  an  optimum  number  of  processors  for  a  given  problems  that 
minimizes  the  total  solution  time  (direct  +  PCG).  For  the  beam  problem  this  minimum  occurs  for  16  pro¬ 
cessors.  This  optimiun  processor  number  is  expected  to  increase  with  the  problem  size. 
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8.  Conclusions 

The  conjugate  gradient  method  may  be  derived  directly  from  the  Lanczos  algotit^m  by  perfoiming  an 
implicit  triangular  factorization  of  the  resulting  reduced  ttidiagonal  matrix.  So  long  as  this  factorization  is 
numerically  stable  the  conjugate  gradient  will  also  be  stable.  For  symmetric  positive  definite  systems  this 
factorization  is  stable.  In  the  case  of  indefinite  equations  the  success  of  the  conjugate  gradient  method  can 
not  be  guaranteed.  However,  the  Lanczos  process  will  always  converge  for  all  symmetric  systems. 

Partial  reorthogonalization  improves  the  robusmess  of  the  Lanczos  algorithm  so  that  it  always  con¬ 
verges.  However,  for  ill  conditioned  system  the  cost  of  partial  rcorthogoruilization  can  be  substantial.  For 
well  conditioned  problems  where  there  is  no  tendency  towards  loss  of  orthogonality  among  the  Lanczos 
vectors,  partial  reorthogonalization  will  margirutlly  increase  the  cost  of  the  Lanczos  algorithm.  Therefore, 
partial  reorthogonalization  should  always  be  used  in  conjunction  with  good  preconditioners.  It  will  pick  up 
the  slack  for  preconditioners. 

The  S  X  S  preconditioner  is  introduced  as  an  extension  of  the  E  x  E  preconditioner.  It  is  a  hybrid  method 
combining  direct  elimination  of  degrees  of  freedom  interior  to  substructures  with  iterative  solution  for  the 
unknowns  on  the  boundary  tKxles  of  substructures.  The  S  x  S  preconditioners  are  always  more  effective  than 
the  Ex E  techniques. 
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Figure  2.  Effect  of  reorthogonalization  on  the  number  of  iterations;  4x16 
beam  with  t  =  1.0  and  diagonal  preconditioning. 
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Effect  of  preconditioning  on  the  number  of  Lanczos  steps  for  20  x  20 
mesh  with  penalty  parameter  of  10. 
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Figure  4.  Comparison  of  different  ExE  preconditioners  using  the  20x20 
mesh  with  penalty  parameter  of  10\ 
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Effect  of  illconditioning  due  to  element  properties  on  Lanczos  and 
CG  methods  (20x20  mesh). 
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Figure  6.  8  x  32  beam  problem  with  <  =  1 .0;  element  aspect  ratio  =  4 
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Figure  7.  8x32  beam  problem  with  <  =  0.1;  element  aspect  ratio  =  40. 


;ct  of  illconditioning  due  to  mesh  refinement  on  Lanczos  and 
methods  for  the  beam  problem. 
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Figure  9.  Effect  of  illconditioning  due  to  element  aspect  ratio  on  Lanczos 
and  CG  methods;  4x16  beam. 
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Figure  10.  Effect  of  illconditioning  due  to  element  aspect  ratio  on  Lanczos 
and  CG  methods;  16x64  beam. 
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